Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
This course intrigues me and I expect it to be very useful for my following research. The content of the course is very interesting that is why I have registered for the course. The course has very friendly and open environment.
Describe the work you have done this week and summarize your learning.
1.This is the data from the international survey of Approaches to Learning. It icludes the attitute toward statistics, exam points, age, gender and anwers to the guestions which were combined in the theree variables as deep,surface and strategic questions. Also the observations where the exam points variable is zero were excluded.The data included 166 observations and 7 variables.
The plot illustrates the normality of the errors disributions.The normal distributions implies that the majority of the obserbations are located close to the regression line. Here only a part of the observations are close what is why normality is questionable. The plot illustrates the relationships between the variables.For example, we can see a negative correlation between points and age, surf and stra, points and surf and etc. There are strong correlations between points and age, age and attitude, age and deep, attitude and deep, age and stra, attitude and stra, deep and stra and etc.
I have created a model with the dependent variable points and independent variables attitude, age and stra. The attitude variable has a high significance (p-value).It has 3.480 estimated coefficient with the positive relations with the dependent variable (when the value of the independent variable rises dependent also rises). Age and stra veriables are significant on 0.05 level. Age has a negative etsimate that means that when it rises the points fall. Srta has a positive relation with the points.
The R-square is not very big since the model predicts around 20% of the points.F statistics compares the model with the model with only intercept. Here it has significant p-value which means that my model predicts better than a model with no predictors.
5.Residuals vs Fitted values plot checks the assumption that size on errors should not depend on the explanotary variables.Here only a small part of the obseravations do not fit the pattern that is why there is no problem with the assumption. Normal QQ-plot shows that the erros of the model are normally distributed since points mostly fall on regression line. Residuals vs Leverage diahnostic measures the impact of the single observation to the model. In the plot we can see several observations with the higher leverage which are located too below the line (-4-2) when the line is around 0.
chooseCRANmirror(graphics=FALSE, ind=1)
install.packages("lmtest", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpfw10F3/downloaded_packages
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
learning2014
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
## 7 M 50 3.5 3.833333 2.250 1.916667 21
## 8 F 37 2.9 3.250000 4.000 2.833333 31
## 9 M 37 3.8 4.333333 4.250 2.166667 24
## 10 F 42 2.1 4.000000 3.500 3.000000 26
## 11 M 37 3.9 3.583333 3.625 2.666667 31
## 12 F 34 3.8 3.833333 4.750 2.416667 31
## 13 F 34 2.4 4.250000 3.625 2.250000 23
## 14 F 34 3.0 3.333333 3.500 2.750000 25
## 15 M 35 2.6 4.166667 1.750 2.333333 21
## 16 F 33 4.1 3.666667 3.875 2.333333 31
## 17 F 32 2.6 4.083333 1.375 2.916667 20
## 18 F 44 2.6 3.500000 3.250 2.500000 22
## 19 M 29 1.7 4.083333 3.000 3.750000 9
## 20 F 30 2.7 4.000000 3.750 2.750000 24
## 21 M 27 3.9 3.916667 2.625 2.333333 28
## 22 M 29 3.4 4.000000 2.375 2.416667 30
## 23 F 31 2.7 4.000000 3.625 3.000000 24
## 24 F 37 2.3 3.666667 2.750 2.416667 9
## 25 F 26 3.7 3.666667 1.750 2.833333 26
## 26 F 26 4.4 4.416667 3.250 3.166667 32
## 27 M 30 4.1 3.916667 4.000 3.000000 32
## 28 F 33 3.7 3.750000 3.625 2.000000 33
## 29 F 33 2.5 3.250000 2.875 3.500000 29
## 30 M 28 3.0 3.583333 3.000 3.750000 30
## 31 M 26 3.4 4.916667 1.625 2.500000 19
## 32 F 27 3.2 3.583333 3.250 2.083333 23
## 33 F 25 2.0 2.916667 3.500 2.416667 19
## 34 F 31 2.4 3.666667 3.000 2.583333 12
## 35 M 20 4.2 4.500000 3.250 1.583333 10
## 36 F 39 1.6 4.083333 1.875 2.833333 11
## 37 M 38 3.1 3.833333 4.375 1.833333 20
## 38 M 24 3.8 3.250000 3.625 2.416667 26
## 39 M 26 3.8 2.333333 2.500 3.250000 31
## 40 M 25 3.3 3.333333 1.250 3.416667 20
## 41 F 30 1.7 4.083333 4.000 3.416667 23
## 42 F 25 2.5 2.916667 3.000 3.166667 12
## 43 M 30 3.2 3.333333 2.500 3.500000 24
## 44 F 48 3.5 3.833333 4.875 2.666667 17
## 45 F 24 3.2 3.666667 5.000 2.416667 29
## 46 F 40 4.2 4.666667 4.375 3.583333 23
## 47 M 25 3.1 3.750000 3.250 2.083333 28
## 48 F 23 3.9 3.416667 4.000 3.750000 31
## 49 F 25 1.9 4.166667 3.125 2.916667 23
## 50 F 23 2.1 2.916667 2.500 2.916667 25
## 51 M 27 2.5 4.166667 3.125 2.416667 18
## 52 M 25 3.2 3.583333 3.250 3.000000 19
## 53 M 23 3.2 2.833333 2.125 3.416667 22
## 54 F 23 2.6 4.000000 2.750 2.916667 25
## 55 F 23 2.3 2.916667 2.375 3.250000 21
## 56 F 45 3.8 3.000000 3.125 3.250000 9
## 57 F 22 2.8 4.083333 4.000 2.333333 28
## 58 F 23 3.3 2.916667 4.000 3.250000 25
## 59 M 21 4.8 3.500000 2.250 2.500000 29
## 60 M 21 4.0 4.333333 3.250 1.750000 33
## 61 F 21 4.0 4.250000 3.625 2.250000 33
## 62 F 21 4.7 3.416667 3.625 2.083333 25
## 63 F 26 2.3 3.083333 2.500 2.833333 18
## 64 F 25 3.1 4.583333 1.875 2.833333 22
## 65 F 26 2.7 3.416667 2.000 2.416667 17
## 66 M 21 4.1 3.416667 1.875 2.250000 25
## 67 F 23 3.4 3.416667 4.000 2.833333 28
## 68 F 22 2.5 3.583333 2.875 2.250000 22
## 69 F 22 2.1 1.583333 3.875 1.833333 26
## 70 F 22 1.4 3.333333 2.500 2.916667 11
## 71 F 23 1.9 4.333333 2.750 2.916667 29
## 72 M 22 3.7 4.416667 4.500 2.083333 22
## 73 M 23 3.2 4.833333 3.375 2.333333 21
## 74 M 24 2.8 3.083333 2.625 2.416667 28
## 75 F 22 4.1 3.000000 4.125 2.750000 33
## 76 F 23 2.5 4.083333 2.625 3.250000 16
## 77 M 22 2.8 4.083333 2.250 1.750000 31
## 78 M 20 3.8 3.750000 2.750 2.583333 22
## 79 M 22 3.1 3.083333 3.000 3.333333 31
## 80 M 21 3.5 4.750000 1.625 2.833333 23
## 81 F 22 3.6 4.250000 1.875 2.500000 26
## 82 F 23 2.6 4.166667 3.375 2.416667 12
## 83 M 21 4.4 4.416667 3.750 2.416667 26
## 84 M 22 4.5 3.833333 2.125 2.583333 31
## 85 M 29 3.2 3.333333 2.375 3.000000 19
## 86 F 29 3.9 3.166667 2.750 2.000000 30
## 87 F 21 2.5 3.166667 3.125 3.416667 12
## 88 M 28 3.3 3.833333 3.500 2.833333 17
## 89 F 21 3.3 4.250000 2.625 2.250000 18
## 90 F 30 3.0 3.833333 3.375 2.750000 19
## 91 F 21 2.9 3.666667 2.250 3.916667 21
## 92 M 23 3.3 3.833333 3.000 2.333333 24
## 93 F 21 3.3 3.833333 4.000 2.750000 28
## 94 F 21 3.5 3.833333 3.500 2.750000 17
## 95 F 20 3.6 3.666667 2.625 2.916667 18
## 96 M 22 3.7 4.333333 2.500 2.083333 17
## 97 M 21 4.2 3.750000 3.750 3.666667 23
## 98 M 21 3.2 4.166667 3.625 2.833333 26
## 99 F 20 5.0 4.000000 4.125 3.416667 28
## 100 M 22 4.7 4.000000 4.375 1.583333 31
## 101 F 20 3.6 4.583333 2.625 2.916667 27
## 102 F 20 3.6 3.666667 4.000 3.000000 25
## 103 M 24 2.9 3.666667 2.750 2.916667 23
## 104 F 20 3.5 3.833333 2.750 2.666667 21
## 105 F 19 4.0 2.583333 1.375 3.000000 27
## 106 F 21 3.5 3.500000 2.250 2.750000 28
## 107 F 21 3.2 3.083333 3.625 3.083333 23
## 108 F 22 2.6 4.250000 3.750 2.500000 21
## 109 F 25 2.0 3.166667 4.000 2.333333 25
## 110 F 21 2.7 3.083333 3.125 3.000000 11
## 111 F 22 3.2 4.166667 3.250 3.000000 19
## 112 F 25 3.3 2.250000 2.125 4.000000 24
## 113 F 20 3.9 3.333333 2.875 3.250000 28
## 114 M 24 3.3 3.083333 1.500 3.500000 21
## 115 F 20 3.0 2.750000 2.500 3.500000 24
## 116 M 21 3.7 3.250000 3.250 3.833333 24
## 117 F 20 2.5 4.000000 3.625 2.916667 20
## 118 F 20 2.9 3.583333 3.875 2.166667 19
## 119 M 31 3.9 4.083333 3.875 1.666667 30
## 120 F 20 3.6 4.250000 2.375 2.083333 22
## 121 F 22 2.9 3.416667 3.000 2.833333 16
## 122 F 22 2.1 3.083333 3.375 3.416667 16
## 123 M 21 3.1 3.500000 2.750 3.333333 19
## 124 M 22 4.0 3.666667 4.500 2.583333 30
## 125 F 21 3.1 4.250000 2.625 2.833333 23
## 126 F 21 2.3 4.250000 2.750 3.333333 19
## 127 F 21 2.8 3.833333 3.250 3.000000 18
## 128 F 21 3.7 4.416667 4.125 2.583333 28
## 129 F 20 2.6 3.500000 3.375 2.416667 21
## 130 F 21 2.4 3.583333 2.750 3.583333 19
## 131 F 25 3.0 3.666667 4.125 2.083333 27
## 132 M 21 2.8 2.083333 3.250 4.333333 24
## 133 F 24 2.9 4.250000 2.875 2.666667 21
## 134 F 20 2.4 3.583333 2.875 3.000000 20
## 135 M 21 3.1 4.000000 2.375 2.666667 28
## 136 F 20 1.9 3.333333 3.875 2.166667 12
## 137 F 20 2.0 3.500000 2.125 2.666667 21
## 138 F 18 3.8 3.166667 4.000 2.250000 28
## 139 F 21 3.4 3.583333 3.250 2.666667 31
## 140 F 19 3.7 3.416667 2.625 3.333333 18
## 141 F 21 2.9 4.250000 2.750 3.500000 25
## 142 F 20 2.3 3.250000 4.000 2.750000 19
## 143 M 21 4.1 4.416667 3.000 2.000000 21
## 144 F 20 2.7 3.250000 3.375 2.833333 16
## 145 F 21 3.5 3.916667 3.875 3.500000 7
## 146 F 20 3.4 3.583333 3.250 2.500000 21
## 147 F 18 3.2 4.500000 3.375 3.166667 17
## 148 M 22 3.3 3.583333 4.125 3.083333 22
## 149 F 22 3.3 3.666667 3.500 2.916667 18
## 150 M 24 3.5 2.583333 2.000 3.166667 25
## 151 F 19 3.2 4.166667 3.625 2.500000 24
## 152 F 20 3.1 3.250000 3.375 3.833333 23
## 153 F 20 2.8 4.333333 2.125 2.250000 23
## 154 F 17 1.7 3.916667 4.625 3.416667 26
## 155 M 19 1.9 2.666667 2.500 3.750000 12
## 156 F 20 3.5 3.083333 2.875 3.000000 32
## 157 F 20 2.4 3.750000 2.750 2.583333 22
## 158 F 20 2.1 4.166667 4.000 3.333333 20
## 159 F 20 2.9 4.166667 2.375 2.833333 21
## 160 F 19 1.9 3.250000 3.875 3.000000 23
## 161 F 19 2.0 4.083333 3.375 2.833333 20
## 162 F 22 4.2 2.916667 1.750 3.166667 28
## 163 M 35 4.1 3.833333 3.000 2.750000 31
## 164 F 18 3.7 3.166667 2.625 3.416667 18
## 165 F 19 3.6 3.416667 2.625 3.000000 30
## 166 M 21 1.8 4.083333 3.375 2.666667 19
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
library(ggplot2)
#initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
#define the visualization type (points)
p2 <- p1 + geom_point()
#add a regression line
p3 <- p2 + geom_smooth(method = "lm")
#add a main title and draw the plot
p4 <- p3 + ggtitle ("Student's attitude versus exam points")
p4
#draw a scatter plot matrix of the variables in learning2014.
#[-1] excludes the first column (gender)
pairs(learning2014[-1])
#access the GGally and ggplot2 libraries
install.packages("GGally")
##
## The downloaded binary packages are in
## /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpfw10F3/downloaded_packages
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpfw10F3/downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
# fit a linear model
my_model <- lm(points ~ 1, data = learning2014)
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
qplot(attitude, age, data = learning2014) + geom_smooth(method = "lm")
qplot(attitude, stra, data = learning2014) + geom_smooth(method = "lm")
# create an plot matrix with ggpairs()
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
#create a regression model with 3 multiple explanatory variables
my_model1 <- lm(points ~ attitude + age + stra, data = learning2014)
summary(my_model1)
##
## Call:
## lm(formula = points ~ attitude + age + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
par(mfrow = c(2,2))
plot(my_model1)
alc <- read.csv( file="data/alc.csv", sep=",")
dim(alc)
## [1] 382 35
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This is a combined data from two original datasets on student achievement in secondary education of two Portuguese schools. It was merged to include only the students who answered the questionnaire in both math and portuguese classes.New column high use was created when’alc_use’ is greater than 2. Data has 382 observations of 35 variables. I choose the 4 following variables to test their relationships with high alcohol consumption. 1. Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart. I suppose that living apart with the students leads to high alcohol consumption.
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). I suppose that bad family relationships result on high alcohol consumption.
goout - going out with friends (numeric: from 1 - very low to 5 - very high). I suppose that student who often go out with friends consume more alcohol.
romantic - with a romantic relationship (binary: yes or no). I suppose that students who are not in relationships consume more alcohol.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob <fct> teacher, other, other, services, other, other, other,…
## $ reason <fct> course, course, other, home, home, reputation, home, …
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
#use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 13,370
## Variables: 2
## $ key <chr> "school", "school", "school", "school", "school", "school"…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
#draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
Bar plots illustrate the following distribution of the variables: Pstatus: the majotiry of students have parents living together romantic: twice more students are no in romantic relations famrel: the majority of the respondents have very good relations with parents,less have excellent and good relations while thre are students who have bad and very bad relations goout: has a normal distribution.
library(dplyr); library(ggplot2)
# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = mean(G3))
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <fct> <lgl> <dbl>
## 1 F FALSE 11.4
## 2 F TRUE 11.7
## 3 M FALSE 12.2
## 4 M TRUE 10.3
library(ggplot2)
#initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3))
#define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")
#initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences))
#define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Student absences by alcohol consumption and sex")
#my boxplots
g3 <- ggplot(alc, aes(x = high_use, y = goout))
g3 + geom_boxplot() + ylab("Go out by alcohol consumption")
g4 <- ggplot(alc, aes(x = high_use, y = famrel))
g4 + geom_boxplot() + ylab("Family relations by alcohol consumption")
First here is the boxplot from the exercise which shows the relations between high use of alcohol and final grades. Here we seen a lot of outliners with the low grades and several with high grades who high consume alcohol. Next there are boxplots on absences. Both boxpots are located quite low which shows the low level of abcences and true assumtion on higher alcohol use. However there are a lot of outliners who do not miss school and consume more as well as go to school and do not consume.
I have constructed the boxplot on goout and high use of alcohol. The true has a higher probability with the minimum 3 and max 5 which is often.The firts boxplot shows that 75% observanions are located low from 1 to 3 (students who go out rare and for them high use of alcohol is false). However there is one outliner which goes out often (5) and does not drink. -> I have support for my hypothesis about go out and high alcohol.
I have constructed the boxplot on family relations and high use of alcohol. In the fist plot we see that the asumption is false for people who have good (3) and excellent relations with parents (5) which coincides with my assumption + we have outliners who have bad relations with parents. The second boxplot illustrates that the assumtion is true for people who have bad and good relations with one outliner who has bad relations but does not consume much -> I have support for my hypothesis about family relations and high alcohol.
I also have tried to construct boxpolts for variables Pstatus and romantic but they look very stange. -> I suppose there are not significant relations between Pstatus and romantic relations with high alcohol consumption but let’s test it anyway in regression.
#find the model with glm()
m <- glm(high_use ~ famrel + goout+romantic+Pstatus, data = alc, family = "binomial")
#print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + romantic + Pstatus,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5849 -0.7766 -0.5662 0.9554 2.4378
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8376 0.7049 -2.607 0.00914 **
## famrel -0.3732 0.1340 -2.784 0.00537 **
## goout 0.7959 0.1185 6.718 1.84e-11 ***
## romanticyes -0.2826 0.2658 -1.063 0.28763
## PstatusT -0.1019 0.4029 -0.253 0.80027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 406.92 on 377 degrees of freedom
## AIC: 416.92
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout romanticyes PstatusT
## -1.8375812 -0.3731507 0.7959347 -0.2826057 -0.1019220
#compute odds ratios (OR)
OR <- coef(m) %>% exp
#compute confidence intervals (CI)
CI <- confint(m)
## Waiting for profiling to be done...
#print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1592020 -3.2514939 -0.4772475
## famrel 0.6885615 -0.6385795 -0.1114287
## goout 2.2165118 0.5699984 1.0354467
## romanticyes 0.7538169 -0.8126779 0.2317825
## PstatusT 0.9031000 -0.8753200 0.7157342
In my model we see that only famrel and goout variables are significant. Famrel coeffcient is -0.3731507 which according to the scale of the variable means that better family relations lead to less alcohol use. Goout coefficient is 0.7959347 which means that often go out with friens leads to higher alcohol consumotion. So my hypothesis regarding these variables are testified while regarding status of parents and romantic relations not. <<<<<<< HEAD Odds show that go out twice increases the probability of alcohol consumption. Good family relations decrease the alcohol consumption by 68%, romantic relations also decrease the alcohol by 75% as well as higher status pf parents by 90%. ======= Odds show a higher than 1 probability of success (high alcohol consumption) for goout. Odds for presence of famreal is also higher (97.5% interval) than absence. >>>>>>> 1ec0b112ee3f4f518ef03287b10fa255ad0529af
#Binary predictions
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
#add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
#use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = "high_use")
#see the last ten original classes, predicted probabilities, and class predictions
select(alc, famrel, goout, high_use, probability, prediction) %>% tail(10)
## famrel goout high_use probability prediction
## 373 4 2 FALSE 0.13702332 high_use
## 374 5 3 TRUE 0.19506126 high_use
## 375 5 3 FALSE 0.19506126 high_use
## 376 4 3 FALSE 0.20967135 high_use
## 377 5 2 FALSE 0.09855474 high_use
## 378 4 4 FALSE 0.43822544 high_use
## 379 2 2 FALSE 0.25087798 high_use
## 380 1 1 FALSE 0.17994514 high_use
## 381 2 5 TRUE 0.78480132 high_use
## 382 4 1 TRUE 0.06684647 high_use
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$high_use)
## prediction
## high_use FALSE TRUE
## FALSE 268 0
## TRUE 0 114
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g6 <- ggplot(alc, aes(x = "high_use", y = "probability"))
# define the geom as points and draw the plot
g6+geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table%>%addmargins
## prediction
## high_use high_use Sum
## FALSE 0.7015707 0.7015707
## TRUE 0.2984293 0.2984293
## Sum 1.0000000 1.0000000
#accuracy and loss function
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7015707
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2513089
I construct a ‘confusion mattrix’ which describes the performance of my model with such predictors as famrel and goout. We see that the majority of cases (268) are cases in which we predicted False and are actually False. There are no cases (0) where we predicted True, but actual is False (so no Type I error) and no cases where we predicted False, True in actual (no Type II error). And there are 114 cases where we predicted True, and True in actual. So we can conlude that my model well predicts the target variable. Next I computed the total proportion of inaccurately classified individuals. We see that the average number of wrong predictions is 0.2984, while the correct predictions is 0.7015707. So the model predicts quite good.
#Cross-validation
#define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#compute the average number of wrong predictions in the (training) data
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2539267
# 10-fold cross-validation
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
alc.glm <- glm(high_use ~ famrel+goout, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2460733 0.2460733
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
## [1] 0.2774869 0.2674543
Here a made a cross-validation from the DataCamp + made myself a 10-fold cross-validation. My model has a little bit higher prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error). So it is not better.
#cross-validation with 6 predictors
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
alc.glm <- glm(high_use ~ famrel+goout+sex+absences+freetime+paid, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2225131 0.2205532
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
## [1] 0.2303665 0.2265426
#cross-validation with 4 predictors: sex, absences, freetime, paid
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
alc.glm <- glm(high_use ~ sex+absences+freetime+paid, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2591623 0.2594707
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
## [1] 0.2617801 0.2618692
#cross-validation with 4 predictors: famrel, goout, absences, freetime
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
alc.glm <- glm(high_use ~ famrel+goout+absences+freetime, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2486911 0.2479510
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
## [1] 0.2408377 0.2363491
#cross-validation with 2 predictors: famrel, absences
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
alc.glm <- glm(high_use ~ famrel+absences, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2931937 0.2938516
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
## [1] 0.3010471 0.3036581
Here I Perform cross-validation to compare the performance of different logistic regression models. I first test model with such predictors as famrel, goout, sex, absences, freetime, paid. It is guite a good model since the errors are lower than before and the training error is less than the test error. Now let’s try to decrease the number of predictors. I have tried the model with 4 predictors: sex, absences, freetime, paid. Here again errors are higher and test erroes are higher. I also tried model with such 4 predictors: famrel, goout, absences, freetime. Here errors are higher than in model with 6 predictors but lower than in the previous model. Also testing errors in 10 validation are less than in training, so the model is quite good. In the end I am trying the model with 2 predictors:family relations and absences. Errors are the highest but the testing ones are less than training.
Conclusion:The higher number of the predictors result in the lower errors of the model since the more factors we have the much our model explains. ***
## Data overview
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#variables distribution
p <- pairs(Boston)
p
## NULL
#calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
#print the correlation matrix
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
#visualize the correlation matrix
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ──────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.3 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
corrplot(cor_matrix, method="circle")
corrplot(cor_matrix, method="circle", type = "upper",cl.pos = "b", tl.pos = "d",tl.cex = 0.6)%>%round()
## crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
## crim 1 0 0 0 0 0 0 0 1 1 0 0 0 0
## zn 0 1 -1 0 -1 0 -1 1 0 0 0 0 0 0
## indus 0 -1 1 0 1 0 1 -1 1 1 0 0 1 0
## chas 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## nox 0 -1 1 0 1 0 1 -1 1 1 0 0 1 0
## rm 0 0 0 0 0 1 0 0 0 0 0 0 -1 1
## age 0 -1 1 0 1 0 1 -1 0 1 0 0 1 0
## dis 0 1 -1 0 -1 0 -1 1 0 -1 0 0 0 0
## rad 1 0 1 0 1 0 0 0 1 1 0 0 0 0
## tax 1 0 1 0 1 0 1 -1 1 1 0 0 1 0
## ptratio 0 0 0 0 0 0 0 0 0 0 1 0 0 -1
## black 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## lstat 0 0 1 0 1 -1 1 0 0 1 0 0 1 -1
## medv 0 0 0 0 0 1 0 0 0 0 -1 0 -1 1
This data has 506 obs. of 14 variables which contain information about different indicators of Boston city as per capita crime rate, index of accessibility to radial highways and etc. In the fist plot we seen the distibution of variables. It looks like only rm has a normal distribution while crime, chas, lstat are shifted to the left, age to the right and etc. In the second plot positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. For example, the is strong positive correlation between such variables as crime and rad and tax, between indus and nox, age and tax. Negative correlations between dis and nox,age and dis, medv and lstat.
## Scaling and factor variable
#center and standardize variables
boston_scaled <- scale (Boston)
#summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
#change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
#create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#create a categorical variable 'crime'
vector <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=vector)
#look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
## Train and test sets
#number of rows in the Boston dataset
n <- nrow(boston_scaled)
#choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
#create train set
train <- boston_scaled[ind,]
#create test set
test <- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes <- c(test$crime)
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 0.456 -0.487 -0.487 ...
## $ indus : num -0.164 -0.548 -0.769 1.015 1.015 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.0664 -0.5324 -1.0675 1.1935 0.3651 ...
## $ rm : num -0.508 -0.287 2.81 -1.184 -1.608 ...
## $ age : num 0.697 -0.884 -2.138 1.116 1.116 ...
## $ dis : num -0.633 0.77 2.428 -1.095 -1.047 ...
## $ rad : num -0.408 -0.522 -0.293 1.66 1.66 ...
## $ tax : num 0.141 -0.719 -0.464 1.529 1.529 ...
## $ ptratio: num -0.303 0.529 0.298 0.806 0.806 ...
## $ black : num -0.129 0.441 0.441 0.441 -1.596 ...
## $ lstat : num 0.4351 0.0192 -1.2761 2.5118 1.04 ...
## $ medv : num -0.4602 -0.0362 2.2036 -1.9063 -0.6777 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 2 3 4 4 3 2 2 1 1 ...
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.437 -0.437 -0.755 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.144 -0.144 -0.481 ...
## $ rm : num 0.413 0.194 -0.303 -0.831 -0.631 ...
## $ age : num -0.12 0.367 1.116 0.939 -0.255 ...
## $ dis : num 0.14007 0.55661 0.18044 -0.00372 -0.1981 ...
## $ rad : num -0.982 -0.867 -0.637 -0.637 -0.522 ...
## $ tax : num -0.666 -0.986 -0.601 -0.601 -0.767 ...
## $ ptratio: num -1.458 -0.303 1.175 1.175 0.344 ...
## $ black : num 0.441 0.441 0.22 0.023 0.229 ...
## $ lstat : num -1.0745 -0.492 0.0542 0.7978 -0.1741 ...
## $ medv : num 0.16 -0.101 -0.873 -1.026 -0.275 ...
Here we scale the data which is an operation when we subtract the column means from the corresponding columns and divide the difference with standard deviation. It helps us to have normal distribution of variables later used in clastering. When we create a factor variable’crim’ and use the quantiles as the break points to the variable.Later we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
#print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2351485 0.2549505 0.2648515
##
## Group means:
## zn indus chas nox rm
## low 1.00934512 -0.9736241 -0.11325431 -0.9213749 0.50195010
## med_low -0.07494267 -0.3334399 -0.02367011 -0.5884727 -0.14427662
## med_high -0.40231848 0.1875300 0.26281077 0.3494989 0.07661588
## high -0.48724019 1.0170108 -0.05155709 1.0517559 -0.43169546
## age dis rad tax ptratio black
## low -0.9282595 0.9308157 -0.6953345 -0.7505153 -0.4413661 0.3817192
## med_low -0.3711903 0.4218996 -0.5478716 -0.5189255 -0.1598469 0.3537603
## med_high 0.3624576 -0.3295288 -0.4132128 -0.3076403 -0.1512178 0.1027392
## high 0.8303935 -0.8673402 1.6392096 1.5148289 0.7820356 -0.7303275
## lstat medv
## low -0.813651578 0.60147630
## med_low -0.138945888 0.01245632
## med_high -0.001299083 0.13714844
## high 0.871741694 -0.68023269
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08491809 0.64279328 -0.85088918
## indus 0.04290387 -0.39665985 0.46508218
## chas -0.08994822 -0.09378806 0.05501015
## nox 0.36904148 -0.65314699 -1.46336270
## rm -0.05873752 -0.06723159 -0.23122514
## age 0.27113459 -0.28679108 -0.03400131
## dis -0.08340764 -0.19048377 0.21121822
## rad 3.07368846 0.96913201 0.35207051
## tax 0.08433696 0.08807308 0.21354115
## ptratio 0.15532552 -0.09750022 -0.56597070
## black -0.12081179 0.01582634 0.18825705
## lstat 0.22322237 -0.16737411 0.25291821
## medv 0.19557595 -0.29432675 -0.34537351
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9498 0.0379 0.0123
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(train$crime)
#plot the lda results
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
We have 4 clasters, so train data was devided in 25% 4 times with crime as target variable. In the plot we see that three clasters are overlapping while the 4rd one in quite far for them. We aslo see that such variables as zn, nox, rad, ptratio have a big impact on the model
# Predictors
#create train set
train <- boston_scaled[ind,]
#create test set
test <- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes <- c(test$crime)
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
#predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## 1 15 9 4 0
## 2 6 13 12 0
## 3 0 2 20 1
## 4 0 0 0 20
#target classes as numeric
classes <- as.numeric(correct_classes)
#plot the lda results
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
In the table we see the relation between the correct classes and the predicted ones. The classifier did not predict the crime rates correctly since predicted numbers are higher than correct. I also tried to vizualize the LDA results for crime in test data and we see that everything is in a mess.
## distances
library(MASS)
data('Boston')
#scale the data
boston_scaled <- scale (Boston)
#euclidean distance matrix
dist_eu <- (boston_scaled)
#look at the summary of the distances
summary(dist_eu)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#manhattan distance matrix
dist_man <- dist(boston_scaled, method = "manhattan")
#look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(boston_scaled, centers = 1)
km <-kmeans(boston_scaled, centers = 3)
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col=km$cluster)
#determine K
set.seed(123)
library(ggplot2)
#determine the number of clusters
k_max <- 10
#calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
#visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
#k-means clustering
km <-kmeans(boston_scaled, centers = 2)
Based on the calculated distance meajure the k-classtering was maded. On the graph we see a significant change in point 2 - so the optimal number of clastters is 2. On the plot we see the vizualization for 2 clasters - which part of data where belongs.
## 3D plot
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# read data
newhuman <- read.csv( file="data/newhuman.csv", row.names = 1)
summary(newhuman)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(dplyr)
# Access GGally
library(GGally)
library(corrplot)
# visualize the 'human_' variables
ggpairs(newhuman)
# compute the correlation matrix and visualize it with corrplot
cor(newhuman)%>%corrplot()
Data has 155 observations and 8 variables. I also removed the country variable from it before in R script. The variable Edu.Exp is normally distributed and variable Edu2.FM has an almost normal distribution. Labo.FM , Life.Exp are very moved to the right while other variables to the left. In the plot, positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the correlation matrix, we see the strong positive correlation between such variables as Mat.Mor and Edu2.FM, Mat.Mor and Edu.Exp, Ado.Birth and Life.Exp and etc. Strong negative correlations are between Mat.Mor and Ado.Birth, Life.Exp and Edu.Exp, etc. Such a variable as Parli.F does not correlate with any others.
## PCA on the not standardized human data.
# perform principal component analysis (with the SVD method)
pca_newhuman <- prcomp(newhuman)
# create and print out a summary of pca_human
s <- summary(pca_newhuman)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_newhuman, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
The PCA on not standardized data is a mess. PC1 exmplains 100% while PC2 0%. The analysis is wrong. Let’s strandartize data and run PCA again.
## PCA on the standardized human data.
# standardize the variables
human_std <- scale(newhuman)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The results with standardized data look fine. PC1 captures 53.6% of the variance in the data while PC2 16.2% variance. They are uncorrelated and PC2 is definitely less important in terms of data capture. Such variables as Adolescent birth rate and Maternal mortality ratio are highly positively correlated with each other and they contribute to the dimensions of PC1. There is no correlation between these two variables and Labo.FM and Parli.F which contribute to the dimensions of PC2. Labo.FM and Parli.F also do not correlate with the variables as Edu.Exp, GNI, Life.Exp and Edu2.FM. However Expected Years of Education, Gross National Income, Life Expectancy at Birth and Population with Secondary Education are all correlated with each other.
library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The tea data has 300 obs. of 36 variables. In the plot we see that variables are categorical that is why we apply multiple correspondence analysis. Let’s interpret the MCA results. In Eigenvalues we got 11 dimensions which are quite a lot. The Dim1 captures the most variance while the Dim11 the least. Later we see Categories summary only for the first three dimensions as the most important. Dim 1: the highest contribution of such variables as tea bag and unpackaged - they also have the highest correlation in comparison with other variables. Dim2: the highest contribution and correlation for tea bag+unpackaged. Dim 3: other contribute and correlate the most. In the Categorical variables, we see the correlation between variables and dimensions. The strongest are between how variable and Dim 1, Dim 2 and between where and Dim 1, Dim 2. Also strong between How and Dim 2, Dim 3. In the plot, we see that Dim 1 captures only 15,24% of the variance in data while Dim 2 captures 14,23%. how similar are the variables based on the distance between them. As such, such variables as tea bag, chain store are very close, unpacked and tea shop also close. Other is distant from most of the variables.